Skip to main content

Pandas for ML

A production ML feature pipeline landed in code review with this pattern:

# Reviewed code - can you spot the three problems?
import pandas as pd

def add_features(df):
df["age_bucket"] = df["age"].apply(lambda x: "young" if x < 30 else "old")
df["revenue_per_visit"] = df["revenue"] / df["visits"]
df["user_segment"][df["revenue"] > 1000] = "high_value" # line 3
return df

users = pd.read_csv("users.csv")
users = add_features(users)
users.to_parquet("features.parquet")

Three distinct problems:

  1. .apply(lambda ...) on a string column runs a Python loop over every row. For 10 million rows, this is 10 million Python function calls instead of a single vectorised operation.
  2. Division by visits will silently produce inf where visits == 0 and NaN where both are NaN. A model trained on these will learn from garbage.
  3. df["user_segment"][df["revenue"] > 1000] = "high_value" is chained indexing - a known bug surface in Pandas. Whether it modifies the original DataFrame or a temporary copy is implementation-defined. The linter correctly raises SettingWithCopyWarning.

These are not beginner mistakes. They appear in production codebases at companies with experienced engineers. The reason is that Pandas has a surface area that rewards familiarity - if you have only used it for exploration, the production patterns are non-obvious.

This lesson teaches the production patterns.

What You Will Learn

  • Memory-efficient dtypes: category, int8, float32, pd.Int32Dtype() - how to cut DataFrame memory by 5-10x
  • The method chaining style and why it makes pipelines readable and safe
  • Vectorised string operations vs .apply() (and when .apply() is unavoidable)
  • GroupBy for feature engineering: transform, agg, shift, cumcount
  • Window functions: rolling, expanding, exponential weighted means
  • Merge patterns: inner join, left join, asof join, broadcast join
  • The correct way to handle missing data in ML context (imputation vs removal vs masking)
  • Converting Pandas to NumPy and PyTorch without unnecessary copies
  • Chained indexing bugs, copy warnings, and how to avoid them
  • Interview patterns

Prerequisites

  • NumPy Internals (Lesson 01)
  • Basic Pandas: read_csv, DataFrame, Series, loc, iloc, groupby, merge

Part 1 - Memory-Efficient Dtypes

The single highest-leverage action you can take on a large DataFrame before any other optimisation is to fix the dtypes. A default read_csv assigns object dtype to string columns and float64 to all numerics. This is almost always 3-10x more memory than necessary.

import pandas as pd
import numpy as np

# Simulate a typical ML dataset: 5M rows, realistic dtypes
np.random.seed(42)
n = 5_000_000

df_raw = pd.DataFrame({
"user_id": np.random.randint(0, 1_000_000, n), # could be int32
"age": np.random.randint(18, 85, n), # could be int8
"country": np.random.choice(["US", "UK", "DE", "IN", "BR"], n), # object
"plan": np.random.choice(["free", "basic", "pro"], n), # object
"revenue": np.random.exponential(50, n), # could be float32
"visits": np.random.randint(0, 1000, n), # could be int16
"is_active": np.random.randint(0, 2, n), # could be bool
"score": np.random.uniform(0, 1, n), # could be float32
})

print(f"Default memory: {df_raw.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
# Default memory: ~385 MB

# --- Optimise dtypes ---
df = df_raw.copy()

df["user_id"] = df["user_id"].astype(np.int32) # 8B → 4B per element
df["age"] = df["age"].astype(np.int8) # 8B → 1B per element
df["country"] = df["country"].astype("category") # object → category (dictionary encoded)
df["plan"] = df["plan"].astype("category")
df["revenue"] = df["revenue"].astype(np.float32) # 8B → 4B per element
df["visits"] = df["visits"].astype(np.int16) # 8B → 2B per element
df["is_active"] = df["is_active"].astype(bool) # 8B → 1B per element
df["score"] = df["score"].astype(np.float32) # 8B → 4B per element

print(f"Optimised memory: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB")
# Optimised memory: ~51 MB (7.5x reduction)

print(df.dtypes)

The category Dtype

category is the most impactful dtype for string columns with low cardinality. Internally, Pandas stores integer codes (1 byte each for up to 255 categories) plus a lookup dictionary. Operations on category columns dispatch to integer comparisons.

import pandas as pd
import numpy as np

n = 10_000_000
countries = np.random.choice(["US", "UK", "DE", "IN", "BR", "CA", "AU", "FR"], n)

df = pd.DataFrame({"country": countries})

# object dtype: each element is a Python string object (pointer + heap allocation)
mem_object = df["country"].memory_usage(deep=True)
print(f"object dtype: {mem_object / 1024**2:.1f} MB") # ~530 MB

df["country_cat"] = df["country"].astype("category")
mem_cat = df["country_cat"].memory_usage(deep=True)
print(f"category dtype: {mem_cat / 1024**2:.1f} MB") # ~10 MB

# Operations on category columns are also faster
import time

t0 = time.perf_counter()
_ = df["country"].value_counts()
t_obj = time.perf_counter() - t0

t0 = time.perf_counter()
_ = df["country_cat"].value_counts()
t_cat = time.perf_counter() - t0

print(f"value_counts - object: {t_obj*1000:.0f} ms, category: {t_cat*1000:.0f} ms")
# object: 890 ms, category: 42 ms (21x faster)

Nullable Integer Dtypes

Standard NumPy int64 cannot represent NaN. Pandas solves this with nullable integer dtypes:

import pandas as pd

# NaN in an integer column forces float64 (loss of precision for large IDs)
s1 = pd.Series([1, 2, None, 4])
print(s1.dtype) # float64 - NaN forced upcasting

# Nullable integer: keeps int semantics with explicit NA
s2 = pd.Series([1, 2, None, 4], dtype=pd.Int32Dtype())
print(s2.dtype) # Int32 (capital I = nullable)
print(s2.isna()) # [False, False, True, False]

# Useful for: user IDs, item IDs - where NaN means "unknown" not "fractional"
# Memory: same as int32 (+ 1 bit per element for the null mask)

A dtype Optimisation Helper

import pandas as pd
import numpy as np

def optimise_dtypes(df: pd.DataFrame, inplace: bool = False) -> pd.DataFrame:
"""
Automatically downcast numeric columns and convert low-cardinality
object columns to category.
"""
if not inplace:
df = df.copy()

for col in df.columns:
col_type = df[col].dtype

if col_type == object:
n_unique = df[col].nunique()
if n_unique / len(df) < 0.5: # less than 50% unique → category
df[col] = df[col].astype("category")

elif pd.api.types.is_integer_dtype(col_type):
col_min, col_max = df[col].min(), df[col].max()
for dtype in [np.int8, np.int16, np.int32, np.int64]:
info = np.iinfo(dtype)
if info.min <= col_min and col_max <= info.max:
df[col] = df[col].astype(dtype)
break

elif pd.api.types.is_float_dtype(col_type):
# Downcast to float32 - acceptable for most ML features
# (float32 has ~7 decimal digits of precision)
df[col] = df[col].astype(np.float32)

return df

Part 2 - Method Chaining

Method chaining allows you to express a multi-step transformation as a single readable pipeline. It avoids intermediate variable names and makes the data flow self-documenting.

import pandas as pd
import numpy as np

np.random.seed(42)
n = 100_000

df = pd.DataFrame({
"user_id": np.random.randint(1, 10_001, n),
"event": np.random.choice(["view", "click", "purchase", "return"], n),
"amount": np.random.exponential(30, n),
"timestamp": pd.date_range("2024-01-01", periods=n, freq="1min"),
"country": np.random.choice(["US", "UK", "DE", "IN"], n),
})
# Inject some nulls
df.loc[np.random.choice(n, 500), "amount"] = np.nan


# Procedural style (hard to follow the transformation chain):
df2 = df.copy()
df2 = df2[df2["event"].isin(["click", "purchase"])]
df2["amount"] = df2["amount"].fillna(0)
df2["log_amount"] = np.log1p(df2["amount"])
df2["hour"] = df2["timestamp"].dt.hour
df2["is_weekend"] = df2["timestamp"].dt.dayofweek >= 5
df2 = df2.drop(columns=["timestamp"])
df2 = df2.rename(columns={"log_amount": "log_revenue"})


# Method chaining style - same transformations, but readable as a pipeline:
df_clean = (
df
.loc[df["event"].isin(["click", "purchase"])] # filter to relevant events
.assign(amount=lambda d: d["amount"].fillna(0)) # impute missing amounts
.assign(log_revenue=lambda d: np.log1p(d["amount"])) # log-transform
.assign(hour=lambda d: d["timestamp"].dt.hour) # time features
.assign(is_weekend=lambda d: d["timestamp"].dt.dayofweek >= 5)
.drop(columns=["timestamp"]) # drop raw timestamp
.reset_index(drop=True) # clean index after filter
)

print(df_clean.dtypes)
print(df_clean.shape)

The .assign() method is key: it returns a new DataFrame with the additional or replaced column, making every step in the chain safe from mutation. It accepts both values and callables (lambdas), where the callable receives the current state of the DataFrame in the chain.

pipe() for Custom Steps

import pandas as pd
import numpy as np

def clip_outliers(df: pd.DataFrame, cols: list, n_sigma: float = 3.0) -> pd.DataFrame:
"""Clip numeric columns at n_sigma standard deviations."""
for col in cols:
mu = df[col].mean()
sigma = df[col].std()
df = df.assign(**{col: df[col].clip(mu - n_sigma * sigma, mu + n_sigma * sigma)})
return df

def add_interaction_features(df: pd.DataFrame) -> pd.DataFrame:
"""Add feature crosses."""
return df.assign(
revenue_per_hour=lambda d: d["log_revenue"] / (d["hour"] + 1),
)

# Use pipe() to inject custom functions into a chain
df_features = (
df
.loc[df["event"].isin(["click", "purchase"])]
.assign(amount=lambda d: d["amount"].fillna(0))
.assign(log_revenue=lambda d: np.log1p(d["amount"]))
.assign(hour=lambda d: d["timestamp"].dt.hour)
.pipe(clip_outliers, cols=["amount"], n_sigma=4.0) # custom function in chain
.pipe(add_interaction_features)
.drop(columns=["timestamp", "event"])
.reset_index(drop=True)
)

Part 3 - Vectorised Operations vs .apply()

.apply() runs a Python function on each row or column. It is almost always the wrong choice for numeric operations and many string operations.

import pandas as pd
import numpy as np
import timeit

n = 1_000_000
df = pd.DataFrame({
"a": np.random.randn(n),
"b": np.random.randn(n),
"label": np.random.choice(["cat", "dog", "bird"], n),
})

# --- Numeric: .apply() vs vectorised ---

# BAD: Python loop via apply (row-wise)
t_apply = timeit.timeit(
lambda: df.apply(lambda row: row["a"] ** 2 + row["b"] ** 2, axis=1),
number=3
) / 3

# GOOD: NumPy vectorised
t_vec = timeit.timeit(
lambda: df["a"] ** 2 + df["b"] ** 2,
number=3
) / 3

print(f"apply (row): {t_apply:.3f}s")
print(f"vectorised: {t_vec:.3f}s")
print(f"Speedup: {t_apply / t_vec:.0f}x")
# apply (row): 12.400s
# vectorised: 0.004s
# Speedup: 3100x

# --- String operations: .apply() vs .str accessor ---
df["label_upper_apply"] = df["label"].apply(str.upper) # Python loop

# Pandas str accessor - vectorised (compiled C for common operations)
df["label_upper_str"] = df["label"].str.upper() # Much faster

t_apply_str = timeit.timeit(lambda: df["label"].apply(str.upper), number=3) / 3
t_str = timeit.timeit(lambda: df["label"].str.upper(), number=3) / 3

print(f"apply (str.upper): {t_apply_str:.3f}s")
print(f"str accessor: {t_str:.3f}s")
# apply (str.upper): 0.420s
# str accessor: 0.190s (still not amazing because object dtype is slow)
# Better: use category dtype so you only apply to unique values:
df["label_cat"] = df["label"].astype("category")
t_cat = timeit.timeit(
lambda: df["label_cat"].cat.rename_categories(str.upper),
number=3
) / 3
print(f"category upper: {t_cat:.4f}s") # ~0.0001s - applied to 3 unique values only

When .apply() Is Acceptable

import pandas as pd
import numpy as np

# ACCEPTABLE USE 1: complex logic that has no vectorised equivalent
# Example: apply a user-defined scoring function that uses multiple columns
# in a conditional chain

def complex_score(row):
"""This logic requires if/else and cannot be simply vectorised."""
if row["plan"] == "free" and row["visits"] > 100:
return row["revenue"] * 0.5
elif row["plan"] == "pro" and row["churn_risk"] > 0.8:
return row["revenue"] * 0.2
elif pd.isna(row["revenue"]):
return 0.0
else:
return row["revenue"]

# For low row counts (< 100K) or one-time preprocessing, apply is fine.
# For 10M+ rows, consider: np.select, np.where chains, or numba.

# ACCEPTABLE USE 2: apply on groupby with aggregation
# (pandas internally uses C groupby machinery - much faster than row-wise apply)
result = df.groupby("plan")["revenue"].apply(lambda x: x.quantile(0.9))

# CORRECT PATTERN: replace conditionals with np.select
# Much faster than row-wise apply for the above example:
conditions = [
(df["plan"] == "free") & (df["visits"] > 100),
(df["plan"] == "pro") & (df["churn_risk"] > 0.8),
df["revenue"].isna(),
]
choices = [
df["revenue"] * 0.5,
df["revenue"] * 0.2,
0.0,
]
result = np.select(conditions, choices, default=df["revenue"])

Part 4 - GroupBy for Feature Engineering

GroupBy is the most powerful feature engineering primitive in Pandas. The key insight is using transform (which returns a Series aligned to the original DataFrame's index) versus agg (which reduces to one row per group).

import pandas as pd
import numpy as np

np.random.seed(42)
n = 500_000

df = pd.DataFrame({
"user_id": np.random.randint(1, 10_001, n), # 10K users
"product_id": np.random.randint(1, 1_001, n), # 1K products
"category": np.random.choice(["A", "B", "C", "D"], n),
"revenue": np.random.exponential(25, n).astype(np.float32),
"timestamp": pd.date_range("2024-01-01", periods=n, freq="1min"),
})

# --- transform: add group statistics back to original rows ---
# Each row gets the mean revenue of its user (without reducing to per-user rows)
df["user_mean_revenue"] = df.groupby("user_id")["revenue"].transform("mean")
df["user_total_revenue"] = df.groupby("user_id")["revenue"].transform("sum")
df["user_tx_count"] = df.groupby("user_id")["revenue"].transform("count")

# Revenue relative to user baseline (feature: is this purchase unusual for this user?)
df["revenue_vs_user_mean"] = df["revenue"] / (df["user_mean_revenue"] + 1e-9)

# --- agg: compute per-group statistics (reduces dimensions) ---
user_stats = (
df.groupby("user_id")
.agg(
mean_revenue=("revenue", "mean"),
total_revenue=("revenue", "sum"),
tx_count=("revenue", "count"),
max_revenue=("revenue", "max"),
unique_products=("product_id", "nunique"),
)
.reset_index()
)

print(f"Per-user stats: {user_stats.shape}") # (10000, 6)
print(user_stats.head())

# --- cumcount and rank inside groups ---
# Sort within user by timestamp to get sequential transaction features
df = df.sort_values(["user_id", "timestamp"])

df["tx_sequence"] = df.groupby("user_id").cumcount() # 0-based transaction number per user
df["tx_rank_revenue"] = df.groupby("user_id")["revenue"].rank(pct=True) # percentile within user

# --- shift: lag features ---
# Previous transaction revenue for the same user
df["prev_revenue"] = df.groupby("user_id")["revenue"].shift(1)
# Difference from previous
df["revenue_delta"] = df["revenue"] - df["prev_revenue"]

Advanced GroupBy: Multiple Aggregations with Named Columns

import pandas as pd
import numpy as np

# Build a user feature table with multiple aggregations in one pass
user_features = (
df.groupby("user_id")
.agg(
# Tuples: (source_col, aggfunc) with output column naming
avg_revenue = ("revenue", "mean"),
p90_revenue = ("revenue", lambda s: s.quantile(0.9)),
total_revenue = ("revenue", "sum"),
tx_count = ("revenue", "count"),
unique_cats = ("category", "nunique"),
first_tx = ("timestamp", "min"),
last_tx = ("timestamp", "max"),
)
.assign(
# Derived feature from aggregated columns - all vectorised
revenue_per_tx=lambda d: d["total_revenue"] / d["tx_count"],
active_days=lambda d: (d["last_tx"] - d["first_tx"]).dt.days,
)
.reset_index()
)

print(f"User feature table: {user_features.shape}")

Part 5 - Window Functions

Rolling and expanding windows are essential for time series features. Pandas implements these efficiently in C for standard aggregations.

import pandas as pd
import numpy as np

n = 100_000
prices = pd.DataFrame({
"timestamp": pd.date_range("2020-01-01", periods=n, freq="1h"),
"price": np.cumprod(1 + np.random.randn(n) * 0.005) * 100,
"volume": np.random.exponential(1_000_000, n),
})
prices = prices.set_index("timestamp")

# --- Rolling window (fixed window) ---
prices["sma_24h"] = prices["price"].rolling(window=24).mean() # 24-hour SMA
prices["std_24h"] = prices["price"].rolling(window=24).std() # 24-hour volatility
prices["high_24h"] = prices["price"].rolling(window=24).max()
prices["low_24h"] = prices["price"].rolling(window=24).min()

# Bollinger bands
prices["bb_upper"] = prices["sma_24h"] + 2 * prices["std_24h"]
prices["bb_lower"] = prices["sma_24h"] - 2 * prices["std_24h"]

# --- Expanding window (growing window, from start) ---
prices["cum_max_price"] = prices["price"].expanding().max()
prices["cum_mean_price"] = prices["price"].expanding().mean()

# --- Exponential weighted mean (EWM) ---
# EWM gives more weight to recent observations (decays exponentially)
# span=24 means the last 24 observations get ~86% of the total weight
prices["ema_24h"] = prices["price"].ewm(span=24, adjust=False).mean()

# --- Rolling with minimum periods ---
# Start the window early but require at least 5 observations
prices["rolling_mean_min5"] = prices["price"].rolling(window=24, min_periods=5).mean()

# --- Rolling across groups ---
# Per-user 7-day rolling spend (must sort by user then time first)
# See the groupby window pattern:
user_events = pd.DataFrame({
"user_id": np.repeat(np.arange(100), 1000),
"timestamp": np.tile(pd.date_range("2024-01-01", periods=1000, freq="1h"), 100),
"spend": np.random.exponential(10, 100_000).astype(np.float32),
})
user_events = user_events.sort_values(["user_id", "timestamp"])

# Rolling within each user group
user_events["rolling_7d_spend"] = (
user_events.groupby("user_id")["spend"]
.transform(lambda s: s.rolling(window=168, min_periods=1).sum()) # 168 hours = 7 days
)

Part 6 - Merge Patterns

Merges (joins) are where Pandas pipelines often hide both bugs and performance problems.

import pandas as pd
import numpy as np

# --- Standard inner join ---
users = pd.DataFrame({
"user_id": np.arange(100_000),
"country": np.random.choice(["US", "UK", "DE"], 100_000),
})
transactions = pd.DataFrame({
"user_id": np.random.randint(0, 120_000, 1_000_000), # some users have no transactions
"amount": np.random.exponential(25, 1_000_000).astype(np.float32),
})

# Inner join: only users who have transactions
result_inner = users.merge(transactions, on="user_id", how="inner")

# Left join: all users, NaN for those with no transactions
result_left = users.merge(transactions, on="user_id", how="left")

print(f"Users: {len(users):,}")
print(f"Transactions: {len(transactions):,}")
print(f"Inner join: {len(result_inner):,}")
print(f"Left join: {len(result_left):,}")


# --- Broadcast join (one-to-many enrichment) ---
# Country-level features broadcast to all users in that country
country_features = pd.DataFrame({
"country": ["US", "UK", "DE"],
"gdp_per_cap": [65_000, 45_000, 52_000],
"market_size": ["large", "medium", "medium"],
})

# This is a "broadcast" join: 100K user rows get their 3 country feature rows
users_enriched = users.merge(country_features, on="country", how="left")
# Each country row broadcasts to all matching user rows - no data is duplicated in the source table


# --- Asof join (time-series alignment) ---
# Join on the most recent exchange rate at or before each transaction timestamp
transactions_ts = pd.DataFrame({
"timestamp": pd.date_range("2024-01-01", periods=1_000, freq="1h"),
"amount": np.random.exponential(100, 1_000),
}).sort_values("timestamp")

fx_rates = pd.DataFrame({
"timestamp": pd.date_range("2024-01-01", periods=100, freq="10h"), # less frequent
"usd_eur_rate": np.cumprod(1 + np.random.randn(100) * 0.001) * 0.92,
}).sort_values("timestamp")

# pd.merge_asof: for each row in left, find the most recent row in right
# that is <= left's key. Perfect for "what was the price at the time of this event?"
merged = pd.merge_asof(
transactions_ts,
fx_rates,
on="timestamp",
direction="backward", # find most recent rate at or before transaction time
)
merged["amount_eur"] = merged["amount"] * merged["usd_eur_rate"]
print(merged[["timestamp", "amount", "usd_eur_rate", "amount_eur"]].head())

Merge Performance Patterns

import pandas as pd
import numpy as np
import time

n_left = 5_000_000
n_right = 100_000

left = pd.DataFrame({
"key": np.random.randint(0, n_right, n_left),
"val": np.random.randn(n_left).astype(np.float32),
})
right = pd.DataFrame({
"key": np.arange(n_right),
"feature": np.random.randn(n_right).astype(np.float32),
})

# --- Sorted merge is faster for large left, small right ---
left_sorted = left.sort_values("key")

t0 = time.perf_counter()
_ = left.merge(right, on="key", how="left")
t_unsorted = time.perf_counter() - t0

t0 = time.perf_counter()
_ = left_sorted.merge(right, on="key", how="left")
t_sorted = time.perf_counter() - t0

print(f"Unsorted merge: {t_unsorted:.2f}s")
print(f"Sorted merge: {t_sorted:.2f}s")

# --- map() as a fast alternative to merge for simple enrichment ---
# For a single column lookup, map() via a dict is faster than merge
right_dict = dict(zip(right["key"], right["feature"]))

t0 = time.perf_counter()
_ = left["key"].map(right_dict)
t_map = time.perf_counter() - t0

print(f"map() lookup: {t_map:.2f}s") # Usually 2-5x faster than merge for this pattern

Part 7 - Handling Missing Data in ML Context

Missing data in ML is not just an inconvenience - it is a source of label leakage, distribution shift, and training-serving skew if not handled consistently.

import pandas as pd
import numpy as np

n = 100_000
df = pd.DataFrame({
"age": np.random.choice([np.nan, 25, 30, 35, 40, 50, 60], n),
"income": np.random.exponential(50_000, n),
"employment": np.random.choice(["employed", "unemployed", np.nan], n),
"credit_score": np.where(np.random.rand(n) > 0.1, np.random.randint(300, 850, n), np.nan),
"label": np.random.randint(0, 2, n),
})

# --- Step 1: Audit missing data ---
missing = df.isnull().sum()
missing_pct = missing / len(df) * 100
print(pd.DataFrame({"count": missing, "pct": missing_pct.round(2)}))

# --- Step 2: Decide strategy per column ---

# Numeric columns: impute with median (robust to outliers) + add missing indicator
df["age_missing"] = df["age"].isna().astype(np.int8) # indicator feature
df["age_imputed"] = df["age"].fillna(df["age"].median()) # impute

df["credit_missing"] = df["credit_score"].isna().astype(np.int8)
df["credit_imputed"] = df["credit_score"].fillna(df["credit_score"].median())

# Categorical: impute with mode or "Unknown" category
df["employment_imputed"] = df["employment"].fillna("Unknown")

# --- Step 3: Compute fill values on training set only ---
# CRITICAL for preventing data leakage. Compute on train, apply to test.

from sklearn.model_selection import train_test_split

train_idx, test_idx = train_test_split(df.index, test_size=0.2, random_state=42)

train_medians = {
"age": df.loc[train_idx, "age"].median(),
"credit_score": df.loc[train_idx, "credit_score"].median(),
}

# Apply training medians to both splits
for split_idx in [train_idx, test_idx]:
df.loc[split_idx, "age_clean"] = df.loc[split_idx, "age"].fillna(train_medians["age"])
df.loc[split_idx, "credit_clean"] = df.loc[split_idx, "credit_score"].fillna(
train_medians["credit_score"]
)

# --- Step 4: Validate no NaN remains in feature columns ---
feature_cols = ["age_clean", "credit_clean", "age_missing", "credit_missing"]
assert df[feature_cols].isnull().sum().sum() == 0, "NaN remaining in features!"
print("No NaN in feature columns: OK")

Part 8 - Converting to NumPy and PyTorch

The last step in a Pandas feature pipeline is conversion to the format expected by the model. This should be zero-copy wherever possible.

import pandas as pd
import numpy as np
import torch

# After all feature engineering, we have a clean numeric DataFrame
n = 50_000
feature_df = pd.DataFrame({
"age": np.random.randint(18, 80, n).astype(np.float32),
"log_income": np.random.randn(n).astype(np.float32),
"tx_count": np.random.randint(0, 500, n).astype(np.float32),
"credit": np.random.uniform(300, 850, n).astype(np.float32),
"is_active": np.random.randint(0, 2, n).astype(np.float32),
})
labels = pd.Series(np.random.randint(0, 2, n), dtype=np.int64)

# --- to_numpy() - zero-copy for uniform dtype DataFrames ---
# This is zero-copy ONLY if all columns have the same dtype (e.g., all float32)
X_np = feature_df.to_numpy(dtype=np.float32) # (50000, 5)
y_np = labels.to_numpy(dtype=np.int64) # (50000,)

print(f"Feature array: {X_np.shape}, {X_np.dtype}")
print(f"Label array: {y_np.shape}, {y_np.dtype}")

# --- torch.from_numpy() - zero-copy conversion ---
# Shares memory with the numpy array. Modifying X_tensor modifies X_np!
X_tensor = torch.from_numpy(X_np)
y_tensor = torch.from_numpy(y_np)

print(f"X tensor: {X_tensor.shape}, {X_tensor.dtype}")

# Verify zero-copy (shared memory)
X_np[0, 0] = -999.0
print(f"X_tensor[0, 0] = {X_tensor[0, 0]}") # -999.0 - shared memory confirmed

# --- When you need a copy (before modifying independently) ---
X_tensor_independent = torch.tensor(X_np) # creates a copy
# OR
X_tensor_independent = torch.from_numpy(X_np.copy())

# --- DataLoader integration ---
dataset = torch.utils.data.TensorDataset(X_tensor, y_tensor)
loader = torch.utils.data.DataLoader(
dataset,
batch_size=512,
shuffle=True,
pin_memory=True, # pin to page-locked memory for faster GPU transfer
num_workers=4, # parallel data loading
)

for batch_X, batch_y in loader:
# batch_X: (512, 5) float32, batch_y: (512,) int64
# Move to GPU: batch_X.to("cuda", non_blocking=True)
break

print(f"Batch X: {batch_X.shape}, Batch y: {batch_y.shape}")

Mixed-Dtype DataFrames

import pandas as pd
import numpy as np

# When columns have different dtypes, to_numpy() must copy and upcast
df_mixed = pd.DataFrame({
"a": np.array([1.0, 2.0], dtype=np.float32),
"b": np.array([3.0, 4.0], dtype=np.float64), # different dtype
})

# This COPIES and upcasts to float64 (the wider type)
X = df_mixed.to_numpy()
print(X.dtype) # float64

# Solution: cast all columns to float32 BEFORE calling to_numpy()
X_f32 = df_mixed.astype(np.float32).to_numpy()
print(X_f32.dtype) # float32 - and if all cols were already float32, this is zero-copy

Part 9 - Chained Indexing and the SettingWithCopyWarning

This is the most common source of silent bugs in Pandas ML pipelines.

import pandas as pd
import numpy as np

df = pd.DataFrame({
"a": [1, 2, 3, 4, 5],
"b": [10, 20, 30, 40, 50],
})

# --- CHAINED INDEXING: DANGEROUS ---
# df["a"][df["a"] > 2] = 99 # SettingWithCopyWarning
# This may or may not modify the original df, depending on Pandas internals.

# Why: df["a"] may return a copy or a view (implementation detail).
# If it returns a copy: the copy is modified, df is unchanged (silent bug).
# If it returns a view: df is modified correctly.

# --- CORRECT: use .loc for assignment ---
df.loc[df["a"] > 2, "a"] = 99
print(df)

# --- Another common pattern: filtering then modifying ---
# BAD: creates a copy, then tries to modify it
high_b = df[df["b"] > 20] # Might be a view OR a copy
high_b["new_col"] = 0 # SettingWithCopyWarning

# GOOD: use .copy() explicitly when you want a separate DataFrame
high_b = df[df["b"] > 20].copy()
high_b["new_col"] = 0 # Clearly modifies the copy

# GOOD: use .loc on the original
df.loc[df["b"] > 20, "new_col"] = 0 # Modifies original df safely

# --- Pandas 3.0 Copy-on-Write ---
# In Pandas 3.0+, Copy-on-Write makes this unambiguous:
# Any indexing always returns a view (until you write to it, which triggers a copy).
# You can opt in now:
pd.options.mode.copy_on_write = True # Enable in Pandas 2.x for future-proofing

Interview Patterns

Pattern 1: Memory audit before training. Always run df.memory_usage(deep=True).sum() and df.dtypes before any ML work. If any column is object dtype for low-cardinality data, cast to category. If any integer column uses int64 unnecessarily, downcast.

Pattern 2: Transform vs agg in GroupBy. agg reduces dimensions (one row per group). transform preserves dimensions (same index as input). Use transform to add group statistics as new columns - this is the idiomatic Pandas way to create entity-level features without a separate merge step.

Pattern 3: The imputation leakage trap. Never compute fill values (medians, means, modes) on the full dataset before splitting. Fit imputation statistics on training data only, then apply to test data. Failing to do this inflates test set performance because the model indirectly "sees" the test distribution through the imputed values.

Pattern 4: Rolling window edge cases. Specify min_periods explicitly. rolling(window=7) produces NaN for the first 6 rows. rolling(window=7, min_periods=1) starts computing from row 1. Know which behaviour your model expects - NaN values in features at inference time will crash most models silently or produce nonsense.

Pattern 5: merge vs map vs join. For a simple column lookup (enrich rows with values from a dictionary), Series.map(dict) is 2-5x faster than df.merge(). Use merge when you need multiple columns from the right table or when the join semantics are complex.

Key Takeaways

  • Fixing dtypes is the single highest-leverage memory optimisation: int64 to int8 saves 7 bytes per element; object to category can save 10-50x for low-cardinality string columns.
  • Method chaining with .assign() and .pipe() produces readable, mutation-safe pipelines. The key: .assign() returns a new DataFrame, so intermediate steps cannot silently mutate shared state.
  • .apply(func, axis=1) is a row-wise Python loop. For numeric operations, replace it with vectorised NumPy expressions. For string operations, use the .str accessor or cast to category and operate only on unique values.
  • groupby().transform() adds group statistics to individual rows without reducing dimensionality - the workhorse of entity-level feature engineering.
  • Imputation statistics must be computed on training data only and applied identically to test data. Computing statistics on the full dataset before splitting is data leakage.
  • df.to_numpy(dtype=np.float32) followed by torch.from_numpy() gives you a zero-copy path from DataFrame to PyTorch tensor when all columns share the same dtype.
  • Chained indexing (df["col"][mask] = value) is undefined behaviour. Always use .loc[mask, "col"] = value for assignment.

Practice Problems

Level 1 - Predict the Output

Problem 1: What is the memory usage difference, and why?

import pandas as pd
import numpy as np

n = 1_000_000
s1 = pd.Series(np.random.choice(["A", "B", "C"], n))
s2 = s1.astype("category")

print(s1.memory_usage(deep=True))
print(s2.memory_usage(deep=True))
Answer

s1 (object dtype) stores a pointer to a Python string object per element: roughly n * 8 bytes (pointer) + 3 * ~50 bytes (string objects) = about 8 MB.

s2 (category dtype) stores n * 1 byte (int8 code) + a small lookup table with 3 entries = about 1 MB.

The exact output will be approximately 8,000,128 bytes vs 1,000,611 bytes. Category uses ~8x less memory.

Problem 2: Is there a bug in this feature engineering code?

import pandas as pd
import numpy as np

df = pd.DataFrame({
"value": [10, 20, np.nan, 40, 50],
"group": ["A", "A", "B", "B", "B"],
})

# Fill NaN with the mean of the full column, then compute group mean
df["value_filled"] = df["value"].fillna(df["value"].mean())
group_mean = df.groupby("group")["value_filled"].mean()
print(group_mean)
Answer

Yes - this is imputation leakage within groups. The NaN in group B is filled with the column mean (which includes group A values), artificially pulling the group B mean toward the global mean. The correct approach fills within the group:

df["value_filled"] = df.groupby("group")["value"].transform(
lambda s: s.fillna(s.mean())
)

This fills group B's NaN with the mean of group B's non-null values only.

Level 2 - Debug Challenge

This pipeline is slow for 10M rows. Identify all performance problems and rewrite it.

import pandas as pd

def slow_pipeline(df):
df["log_revenue"] = df["revenue"].apply(lambda x: __import__("math").log(x + 1))
df["label"] = df["category"].apply(
lambda c: "premium" if c in ["A", "B"] else "standard"
)
df["norm_score"] = df.apply(
lambda row: (row["score"] - df["score"].mean()) / df["score"].std(), axis=1
)
return df
Answer

Three problems:

  1. math.log via .apply() is a Python loop over 10M elements. Replace with np.log1p(df["revenue"]).
  2. String conditional via .apply() on a low-cardinality column. Replace with df["category"].isin(["A", "B"]) as a boolean mask, or np.where.
  3. .apply(axis=1) with df["score"].mean() computed inside the lambda - this computes the global mean 10M times. Compute once outside.
import numpy as np

def fast_pipeline(df):
score_mean = df["score"].mean()
score_std = df["score"].std()
return (
df
.assign(log_revenue=np.log1p(df["revenue"]))
.assign(label=np.where(df["category"].isin(["A", "B"]), "premium", "standard"))
.assign(norm_score=(df["score"] - score_mean) / score_std)
)

Level 3 - Design Challenge

You have an event log with 100M rows: (user_id, item_id, event_type, timestamp, session_id). You need to build a feature table at the (user, item) level for a recommendation model, containing:

  • Number of times the user viewed the item
  • Number of times the user purchased the item
  • Days since the user last viewed the item (relative to max timestamp in the dataset)
  • User's overall purchase rate (purchases / views across all items)
  • Item's overall view count (popularity feature)

Design the Pandas pipeline. Identify where you would use groupby, where you would use merge, and how to avoid the most common performance mistakes.

Solution Sketch
import pandas as pd
import numpy as np

# Assume df has columns: user_id, item_id, event_type, timestamp
# Cast dtypes first (user_id → int32, item_id → int32, event_type → category)
df = df.assign(
user_id=df["user_id"].astype(np.int32),
item_id=df["item_id"].astype(np.int32),
event_type=df["event_type"].astype("category"),
)

max_ts = df["timestamp"].max()

# 1. User-item level features via groupby + agg (one pass)
views = df[df["event_type"] == "view"]
purchases = df[df["event_type"] == "purchase"]

ui_views = (
views.groupby(["user_id", "item_id"])
.agg(
view_count=("timestamp", "count"),
last_view_ts=("timestamp", "max"),
)
.assign(days_since_view=lambda d: (max_ts - d["last_view_ts"]).dt.days)
.drop(columns="last_view_ts")
)

ui_purchases = (
purchases.groupby(["user_id", "item_id"])
.agg(purchase_count=("timestamp", "count"))
)

# 2. User-level purchase rate
user_stats = df.groupby("user_id").agg(
total_views=("event_type", lambda s: (s == "view").sum()),
total_purchases=("event_type", lambda s: (s == "purchase").sum()),
)
user_stats["purchase_rate"] = user_stats["total_purchases"] / user_stats["total_views"].clip(lower=1)

# 3. Item popularity
item_stats = views.groupby("item_id").agg(item_view_count=("timestamp", "count"))

# 4. Join everything
ui_features = (
ui_views
.join(ui_purchases, how="left")
.fillna({"purchase_count": 0})
.join(user_stats[["purchase_rate"]], on="user_id")
.join(item_stats, on="item_id")
.reset_index()
)

print(ui_features.shape)
print(ui_features.dtypes)

Key decisions: (1) compute groupby stats in single passes per aggregation level, not row-wise; (2) use .join() with pre-computed stat tables (faster than merging the full 100M-row df multiple times); (3) cast dtypes before any groupby to reduce memory during sort/hash operations.

© 2026 EngineersOfAI. All rights reserved.